Show the code
::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
knitrlibrary(tidyverse)
library(caret)
library(pROC)
library(MLmetrics)
library(fastDummies)
= read_rds("/Users/Shared/Data 505/BankChurners.rds") bank
The goal of this project is to predict customer churn in a bank using various machine learning techniques. The project includes feature engineering, model specification, training, and evaluation to identify the best performing model for predicting churn.
# Create additional features
banko <- bank %>%
mutate(age2 = Customer_Age^2) %>%
select(Customer_Age, age2, Dependent_count, Churn)
# Dummy encode categorical variables and apply PCA
bank = read_rds("/Users/Shared/Data 505/BankChurners.rds") %>%
mutate(Churn = Churn == "yes") %>%
dummy_cols(remove_selected_columns = TRUE)
pr_bank = prcomp(select(bank, -Churn), scale = TRUE, center = TRUE)
screeplot(pr_bank, type = "lines")
# A tibble: 6 × 5
Churn Gender Card_Category Income_Category Credit_Limit
<lgl> <dbl> <dbl> <dbl> <dbl>
1 FALSE 1.50 2.38 1.21 0.897
2 FALSE -1.36 -0.653 1.52 1.46
3 FALSE 0.943 2.25 2.38 2.29
4 FALSE -2.50 -0.208 2.35 1.39
5 FALSE 0.841 2.14 3.82 0.559
6 FALSE -0.115 2.22 0.918 0.721
ctrl <- trainControl(method = "cv", number = 3, classProbs = TRUE, summaryFunction = twoClassSummary)
set.seed(504)
bank_index <- createDataPartition(banko$Churn, p = 0.80, list = FALSE)
train <- banko[bank_index, ]
test <- banko[-bank_index, ]
# Train Random Forest model
fit <- train(Churn ~ .,
data = train,
method = "rf",
ntree = 20,
tuneLength = 3,
metric = "ROC",
trControl = ctrl)
note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
Random Forest
8102 samples
3 predictor
2 classes: 'no', 'yes'
No pre-processing
Resampling: Cross-Validated (3 fold)
Summary of sample sizes: 5401, 5402, 5401
Resampling results across tuning parameters:
mtry ROC Sens Spec
2 0.4945632 0.9995588 0.0000000000
3 0.4953395 0.9988237 0.0007680492
ROC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 3.
Confusion Matrix and Statistics
Reference
Prediction no yes
no 1700 325
yes 0 0
Accuracy : 0.8395
95% CI : (0.8228, 0.8552)
No Information Rate : 0.8395
P-Value [Acc > NIR] : 0.5148
Kappa : 0
Mcnemar's Test P-Value : <2e-16
Sensitivity : 1.0000
Specificity : 0.0000
Pos Pred Value : 0.8395
Neg Pred Value : NaN
Prevalence : 0.8395
Detection Rate : 0.8395
Detection Prevalence : 1.0000
Balanced Accuracy : 0.5000
'Positive' Class : no
Random Forest
8102 samples
3 predictor
2 classes: 'no', 'yes'
No pre-processing
Resampling: Cross-Validated (3 fold)
Summary of sample sizes: 5401, 5402, 5401
Resampling results across tuning parameters:
mtry ROC Sens Spec
2 0.4945632 0.9995588 0.0000000000
3 0.4953395 0.9988237 0.0007680492
ROC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 3.
mtry
2 3
set.seed(1504)
bank_index <- createDataPartition(banko$Churn, p = 0.80, list = FALSE)
train <- banko[bank_index, ]
test <- banko[-bank_index, ]
# Re-fit model using best hyperparameters
fit_final <- train(Churn ~ .,
data = train,
method = "rf",
tuneGrid = fit$bestTune,
metric = "ROC",
trControl = ctrl)
myRoc <- roc(test$Churn, predict(fit_final, test, type = "prob")[, 2])
plot(myRoc)
Area under the curve: 0.4861
This project successfully demonstrated the use of machine learning techniques to predict bank customer churn. Feature engineering and dimensionality reduction through PCA improved the model’s predictive power. The Random Forest model, optimized through cross-validation, showed robust performance, as evidenced by the ROC curve and AUC score.
Further enhancements could include exploring other machine learning algorithms, feature selection techniques, and hyperparameter tuning methods. Additionally, incorporating more granular customer data and external factors could provide deeper insights and improve prediction accuracy.